AITopics | html document

Collaborating Authors

html document

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Neural Information Processing SystemsFeb-11-2026, 20:41:12 GMT

Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs).

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(2 more...)

Industry:

Law (1.00)
Information Technology (1.00)
Government (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Neural Information Processing SystemsOct-10-2025, 00:27:22 GMT

Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs).

corpusid, dataset, semanticscholar, (17 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(2 more...)

Industry:

Law (1.00)
Information Technology (1.00)
Government (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

ReaderLM-v2: Small Language Model for HTML to Markdown and JSON

Wang, Feng, Shi, Zesheng, Wang, Bo, Wang, Nan, Xiao, Han

arXiv.org Artificial IntelligenceMar-2-2025

We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.

content extraction, dataset, extraction, (14 more...)

arXiv.org Artificial Intelligence

2503.01151

Country:

North America > United States (0.14)
Europe > Netherlands > North Brabant > Eindhoven (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(6 more...)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

Add feedback

MARAGS: A Multi-Adapter System for Multi-Task Retrieval Augmented Generation Question Answering

DeHaven, Mitchell

arXiv.org Artificial IntelligenceSep-4-2024

In this paper we present a multi-adapter retrieval augmented generation system (MARAGS) for Meta's Comprehensive RAG (CRAG) competition for KDD CUP 2024. CRAG is a question answering dataset contains 3 different subtasks aimed at realistic question and answering RAG related tasks, with a diverse set of question topics, question types, time dynamic answers, and questions featuring entities of varying popularity. Our system follows a standard setup for web based RAG, which uses processed web pages to provide context for an LLM to produce generations, while also querying API endpoints for additional information. MARAGS also utilizes multiple different adapters to solve the various requirements for these tasks with a standard cross-encoder model for ranking candidate passages relevant for answering the question. Our system achieved 2nd place for Task 1 as well as 3rd place on Task 2.

hallucination, information, wang, (14 more...)

arXiv.org Artificial Intelligence

2409.03171

Country:

Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > Texas > Bexar County > San Antonio (0.04)
(5 more...)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Awadalla, Anas, Xue, Le, Lo, Oscar, Shu, Manli, Lee, Hannah, Guha, Etash Kumar, Jordan, Matt, Shen, Sheng, Awadalla, Mohamed, Savarese, Silvio, Xiong, Caiming, Xu, Ran, Choi, Yejin, Schmidt, Ludwig

arXiv.org Artificial IntelligenceJun-17-2024

Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises of one trillion text tokens and three billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS.

dataset, mint-1t, semanticscholar, (17 more...)

arXiv.org Artificial Intelligence

2406.11271

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(6 more...)

Genre: Research Report (0.82)

Industry:

Law (1.00)
Information Technology (1.00)
Government (1.00)

Technology:

Information Technology > Software (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(4 more...)

Add feedback

Dual-View Visual Contextualization for Web Navigation

Kil, Jihyung, Song, Chan Hee, Zheng, Boyuan, Deng, Xiang, Su, Yu, Chao, Wei-Lun

arXiv.org Artificial IntelligenceFeb-6-2024

Automatic web navigation aims to build a web agent that can follow language instructions to execute complex and diverse tasks on real-world websites. Existing work primarily takes HTML documents as input, which define the contents and action spaces (i.e., actionable elements and operations) of webpages. Nevertheless, HTML documents may not provide a clear task-related context for each element, making it hard to select the right (sequence of) actions. In this paper, we propose to contextualize HTML elements through their "dual views" in webpage screenshots: each HTML element has its corresponding bounding box and visual content in the screenshot. We build upon the insight -- web developers tend to arrange task-related elements nearby on webpages to enhance user experiences -- and propose to contextualize each element with its neighbor elements, using both textual and visual features. The resulting representations of HTML elements are more informative for the agent to take action. We validate our method on the recently released Mind2Web dataset, which features diverse navigation domains and tasks on real-world websites. Our method consistently outperforms the baseline in all the scenarios, including cross-task, cross-website, and cross-domain ones.

html document, neighbor, ual -vcr vnei, (14 more...)

arXiv.org Artificial Intelligence

2402.04476

Country:

North America > Canada > Ontario > Toronto (0.05)
North America > United States > New York (0.05)
North America > United States > Ohio (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

Paster, Keiran, Santos, Marco Dos, Azerbayev, Zhangir, Ba, Jimmy

arXiv.org Artificial IntelligenceOct-10-2023

There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known open source web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, openly released on the Hugging Face Hub, will help spur advances in the reasoning abilities of large language models.

dataset, language model, openwebmath, (16 more...)

arXiv.org Artificial Intelligence

2310.06786

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(4 more...)

Genre: Research Report (0.64)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Gur, Izzeddin, Furuta, Hiroki, Huang, Austin, Safdari, Mustafa, Matsuo, Yutaka, Eck, Douglas, Faust, Aleksandra

arXiv.org Artificial IntelligenceOct-2-2023

Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.

arxiv preprint arxiv, language model, website, (12 more...)

arXiv.org Artificial Intelligence

2307.12856

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(22 more...)

Genre: Research Report (0.63)

Industry:

Leisure & Entertainment (0.46)
Information Technology (0.46)
Banking & Finance > Real Estate (0.33)

Technology:

Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Optimal Scraping Technique: CSS Selector, XPath, & RegEx - DataScienceCentral.com

#artificialintelligenceMar-1-2022, 15:23:59 GMT

In nearly all cases, what is required is a small sample from a very large file (e.g. Therefore, an essential part of scraping is searching through an HTML document and finding the correct information. How that should be done is the matter of some debate, preferences, experience, and types of data. While all scraping and parsing methods are "correct", some of them have benefits that may be vital when more optimization is required. Some methods may be easier for specific types of data.

css selector, regex, xpath, (13 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence (0.37)
Information Technology > Communications > Web (0.36)
Information Technology > Software (0.32)

Add feedback

Python -- from A to Z

#artificialintelligenceSep-6-2021, 06:11:41 GMT

Python is one of the leading programming languages across the globe. It is used in many contexts from data science, robotics, web development to gaming, and rapid prototyping. Its simple syntax makes Python programs really easy to read and write, ensuring a rapid learning curve. Additionally, Python has'batteries included' -- multiple libraries (standard and third-party) that will greatly facilitate your work as a programmer. Although a basic programming level can be reached faster than with other programming languages, it definitely takes some time to master Python.

anaconda, information, python, (14 more...)

#artificialintelligence

Country: Europe > Spain > Galicia > Madrid (0.05)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Software > Programming Languages (0.93)
Information Technology > Communications (0.75)

Add feedback